Georgia Institute of Technology - EpiDetector

 

VAST 2010 Challenge

Hospitalization Records – Characterization of Pandemic Spread

 

Authors and Affiliations:

 

Jaeyeon Kihm, Georgia Institute of Technology, jkihm3@gatech.edu [PRIMARY contact]

Jaegul Choo, Georgia Institute of Technology, joyfull@cc.gatech.edu

Carsten Gorg, Georgia Institute of Technology, goerg@cc.gatech.edu

Hanseung Lee, Georgia Institute of Technology, hanseung.lee@gatech.edu

Zhicheng Liu, Georgia Institute of Technology, zliu6@cc.gatech.edu

Heasun Park, Georgia Institute of Technology, hpark@cc.gatech.edu

John Stasko, Georgia Institute of Technology, stasko@cc.gatech.edu

 

Tool(s):

 

The visual analytics tool for the VAST 2010 mini challenge 2, EpiDetector, visualizes the overall hospitalization records across cities involved in the epidemic. We developed this tool here at the Georgia Institute of Technology for the challenge. It enables users to analyze an epidemic outbreak across cities as well as the syndromes causing deaths. This application has three different interactive views:

·        Main frame: The user can select either a city/location or a syndrome for filtering the data relevant to the selection.

·        Overview of hospitalization records: An upper graph shows a timeline with the number of admitted patients per date or the number of patients who were admitted and died on a date, along with the causing medical syndrome. A lower graph shows mortalities and the number of days from the patient’s admittance to their death.

·        Syndrome composition view: This window shows the composition of syndromes and the list of symptoms for a set of mortalities.

 

NOTE: In this project, we call the individual patient’s problem listed in the given data files as a “symptom.”  We aggregate similar symptoms into eight primary “syndromes”. 

i.e. “vomiting, diarrhea” is a “symptom” but “gastrointestinal” is one of the eight syndromes we used in this project.

 

Video:

[Play VideoMC2]

 

 

ANSWERS:


MC2.1: Analyze the records you have been given to characterize the spread of the disease.  You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease.  Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak.  They are looking for visualization tools that will save them analysis time so they can react quickly.

 

1.      Method

The hospitalization records include admission date, age, gender, patient id, and symptoms for the patients, and death date and patient id in the death records. Since characterizing the spread of the disease is the main focus of this challenge, describing which syndrome occurred where and when is of vital importance. First, we needed to clean the noisy, free-text symptoms composed of many of abbreviations and misspelled words. We decided to take all the different symptoms and classify them into 8 syndromes monitored by the Real-time Outbreak and Disease Surveillance (RODS) at University of Pittsburgh. We felt that the 8 simple syndromes would be more useful to recognize the epidemic outbreak rather than all the different symptoms.

 

We used three different approaches to pre-process the free-text symptoms. First, we divided two consecutive different symptoms that were composited into one word such as “diarrheavomiting”. To do this, we iterated through each symptom, one letter at a time, checking the string so far for being a valid term in a medical dictionary. Second, we detected duplicated words like “headacheheadache” using a similar idea but throwing away duplications. Finally, we expanded abbreviations such as “ab” or “abd” for “abdomen” by checking known lists. With this preprocessing, many ambiguous symptoms were cleaned.

 

Next we created a rule-based classification using RODS syndromic definitions to determine which symptoms belonged to which syndromes.

 

Once the data was cleaned and the symptoms classified into the 8 categories, we built a visualization tool to read the data and show the results.  Figure 1 shows a screenshot from the application.  The top graph shows overall patient admittances to a hospital, and the lower region shows the mortalities.

 

2.      Mortality rates

We could analyze the mortality rate of each syndrome by showing the number of patients who died (Figure 1). We created a bar chart where each person is positioned on the day that they entered the hospital (on the x-axis) and is colored by the number of days they were in the hospital before dying. Figure 1 shows this pattern for Iran, and one can clearly see the rise in the number of deaths in the middle of the period.  Similarly, most of the bars are composed of large red regions which correspond to the person being in the hospital for eight days before passing away. To see which of the syndrome are most connected to the epidemic, we compared the pattern of each syndrome on the upper and lower graphs of Figure 1. Since the pattern of the gastrointestinal syndrome on upper graph is most similar to the trend of the epidemic on lower graph, we suspected that the gastrointestinal syndrome and the epidemic might be strongly correlated. Specifically, we could see the number of patients having gastrointestinal syndrome is dominant among the patients who died in eight days on May 18, 2009 (Figure 2). Moreover, we could check the most frequent symptoms on death records: vomiting (265 occurrences), pain (161), abdomen (122), diarrhea (117) and fever (81) (Figure 2). The same pattern of death records was also found on other cities except for Turkey and Thailand (Figure 3), so we could expect that the epidemic is infectious and may transmitted to all cities except for those two locations.

 

3.      Outbreak Pattern across cities

We found that it typically makes patients die in 8 days. By analyzing the pattern of epidemic outbreak of each city we could see the outbreak pattern across cities (table1).

 

 

Onset

Peak

Recovery

Outbreak duration

Nairobi

April 20

May 14

June 16

58 days

Lebanon

April 22

May 16

June 18

58 days

Venezuela

April 22

May 18

June 19

59 days

Aleppo

April 24

May 15

June 17

55 days

Yemen

April 24

May 17

June 18

56 days

Karachi

April 24

May 17

June 18

56 days

Iran

April 24

May 18

June 20

58 days

Saudi Arabia

April 25

May 18

June 20

57 days

Colombia

April 26

May 20

June 19

55 days

Table 1. Key timings and statistics about the disease outbreak in different locations.

 

We set the onset to be the date of the first suspected death, the peak to be the day having most suspected deaths, and the recovery to be the last day a suspected death occurred.  Locations are sorted in the table by the onset date.

 

4.      Future Epidemic Outbreak

The epidemic appeared to begin in Nairobi, Kenya with early onset also in Venezuela and Lebanon. It quickly spread to Syria, Yemen, Pakistan, Iran, Saudi Arabia, and Colombia.  One might expect that neighboring countries would soon be at risk.


MC2.2:  Compare the outbreak across cities.  Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities.  Identify any anomalies you found.

 

In MC 2-1, we characterized the timing of the epidemic in different cities.  The timing of the outbreaks was very close, spread by just a few days.  All cities had very similar recovery and duration times as well, except of course, for Turkey and Thailand. Aleppo Syria and Colombia seemed to recover slightly faster than other cities. Using our application, we could get the number of people infected and a mortality rate by checking all dates for the number of deaths and the total number of patients in each city. Table 2 shows the results and indicates that Saudi Arabia had the lowest mortality rate and Aleppo the highest. Referring to table 1, we might also say that Colombia has the best recovery ability with the shortest outbreak duration even though its death rate was not quite low in table 2. Again Thailand and Turkey doesn’t have the pattern of deaths making patient die in 8 days and main death reason people die in these location was the syndrome labeled with “other”.

 

 

Death rates

 

(the # of deaths

/ the # of patients)

Death rates of people infected

 

(the # of death infected

 / the # of patients)

Number of

deaths infected

Aleppo

3.51%

3.44%

78672

Colombia

2.32%

2.23%

16338

Iran

2.20%

2.12%

11926

Karachi

2.31%

2.24%

165605

Lebanon

1.73%

1.66%

7646

Nairobi

3.40%

3.34%

43959

Saudi Arabia

1.62%

1.54%

21529

Venezuela

2.27%

2.20%

3717

Yemen

2.56%

2.48%

7711

Table 2. The number of deaths infected and the death rate of each city

 

 

 

Figures:

 

[open original figure 1]

Figure 1. Location overview of hopitalization records - Iran

 

[open original figure 2]

Figure 2. Syndrome composition view

 

[open original figure 3]

Figure 3. Location overview of hospitalization records - Turkey